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ABSTRACT 

Cosmological simulations are the key tool for investigating the different processes involved in the 
formation of the universe from small initial density perturbations to galaxies and clusters of galaxies 
observed today. The identification and analysis of bound objects, halos, is one of the most important 
steps in drawing useful physical information from simulations. In the advent of larger and larger 
simulations, a reliable and parallel halo finder, able to cope with the ever-increasing data files, is 
a must. In this work we present the freely available MPI parallel halo finder Ahf. We provide 
a description of the algorithm and the strategy followed to handle large simulation data. We also 
describe the parameters a user may choose in order to influence the process of halo finding, as well 
as pointing out which parameters are crucial to ensure untainted results from the parallel approach. 
Furthermore, we demonstrate the ability of Ahf to scale to high resolution simulations. 
Subject headings: methods: numerical 



1. INTRODUCTION 

The identification and hierarchical grouping of 
'clumps' within large sets of particles (may it be dark 
matter, star or gas particles) produced by cosmological 
simulation codes is the objective in halo-finding. A vari- 
ety of methods have been developed and have seen many 
improvements over the years, however, at heart, all meth- 
ods try to identify peaks in the density field and group 
particles around those peaks. 

The classical method to identify isolated structures is 
the purely geometrical 'Frien ds-of- Friends' (Fof) algo- 
rithm (e.g. iDavis et al.l Il985f ) in which particles closer 
than a given scale (the linking length) are connected. 
The whole particle distribution then separates into iso- 
lated regions where outside particles do not come closer 
than the linking length. A serious shortcoming of 
this method is the danger of linking two blobs to- 
gether via a 'linking bridge'. Additionally, with a fixed 
linking length it is impossible to identify substructure 
within a Fof halo. Many variants of this method have 
been developed, trying to overcome the short-comings 
of the classical algorithm, either by u s ing adaptive 
linking lengths (ISuginohara fc Sutol 119921 : Ivan Ramped 
Il995t lOkamoto fc Habelll999h , multiple linking lengths 
( Klvpin et al.lll999j hierac hical Fof) or the inclusion of 
earlier snapshot ana l yses (ICouchman fc Carlbcrg 119921 : 
ISummers et~aTlll995t Klvpin et al.lll999fh 

Most other halo finding methods employ an explicit 
measure of the density field. One of the first meth- 
ods to d o so is the Spherical Oyerdensity (So) al- 
gorithm ([Press fc Schechterl 11974 IWarren et all 11992b 
lLacev fc Cold I1994T ) whichcalculates the density from 
the distance to the Nth nearest neighbour. It then 
searches for density peaks and grows spheres about them 
until a certain overdensity threshold is reached iteratively 
adapting the center to the center of the enclosed parti- 



cles. 

The Denmax algorithm (jBertschinger fc Geibl 119911 : 
iGelb fc Bertschingen[l99l uses a grid to calculate the 
density field and then moves the particles on a path given 
by the density gradient until a local density maxima 
is reached. This artificially increases the separation of 
clumps and circumvents the linking bridges, so that then 
a Fof approach can be used to coll ect halo particles. 
Similar in spirit is the Hop algorithm (jEisenstein fc Hu"D 
1998) which employs a different way to move the particles 
to a density maxima, avoiding the calculation of a den- 
sity gradient by instead 'hopping' to a neighbouring par- 
ticle associated with the h ighest densi t y. The offspring 
of De nmax, Skid (see e.g.lStadefeOOltlGovernato et al. 
| 1997t IWeinberg etaTI Il997t Uang-Condell fc Hernquist 
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120011 ) uses a Lagrangian density estimator similar to the 
one employed by the So al gorithm. 

The Bdm me thod (jKlypin fc Holtzmanl 119971 : 
iKlvnin et al.l Il999h uses randomly placed spheres 
with predefined radius which are iteratively moved to 
the center of mass of the particles contained in them 
until the density center is found. This iteration process 
is also used in t he So method. Recently, the Voboz 
(jNevrinck et al.l l2005f ) technique has been described, 
which uses a Voronoi tessellation t o calculate the local 
density. The SU BFIND algorithm (jSpringel et all l200ll : 
iDolag et alJ l2008) uses in a first step a Fof method and 
then looks for saddle points in the density field within 
e ach Fof group. 

iGill et ail (|2004l ) used the grid hierarchy generated by 
an ad aptive mesh refinement (AMR) code (|Knebe et al.l 
120011 Mlapm) to construct a halo finder, Mhf. The 
grid hierarchy is built in such a way that the grid is 
refined in high density regions and hence naturally traces 
density contours. This can be used to not only select 
halos, but also to identify included substructure. The 
AMR grid structure naturally defines the halo-subhalo 
hierarchy on all levels, which is a mandatory requirement 
for any state-of-the-art halo finder. 

One further very important aspect in finding substruc- 
ture is the pruning of their associated particle lists to 
remove unbound particles to get a dynamical description 
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of the halo. All current halo finding tools (Skid, Bdm, 
Subfind, Voboz, Mhf) perform this step, however the 
degree to which the potential field is reconstructed to 
define the binding energy varies. Usually the halo is 
treated in isolation and only recently methods have been 
proposed to handle the inclusion of tidal effects to de- 
fine the demarcation of subhalos (e.g . IWeller et al.ll2005t 
iKim & Parlj[200l iShaw et alJl200l . 

In this work we will describe the successor of Mhf 
named Ahf (the Amiga Halo Finder). Amiga aims to 
be the re placement for the c osmological simulation code 
Mlapm (jKnebe et~aT1l2001h and is capable of doing the 
halo analysis during the course of the simulation. How- 
ever, Ahf can also be used as a stand-alone halo finder 
and as such it is described in this paper. It features new 
reading routines which can handle large simulation data 
in parallel and enhanced features, although the principle 
algorithmic idea of Mhf is kept. 

The paper is organized as follows. S}2] introduces the 
method used to identify halos in cosmological simulations 
and describes the parallel strategy. In i]5]we apply Ahf 
and show the impact of the free parameters and verify 
that the parallel approach does not introduce artefacts. 
We then compare Ahf to other halo finders and to the- 
oretical predictions in Sj4] $5] shows the scalability and 
stability of our results. We conclude and summarize in 



2. HALO FINDING THE Ahf WAY 

In this section we describe our algorithm for find- 
ing structures in cosmological simulations. Ahf is the 
successor of Mhf. For a more detailed description of 
the under l ying principles of Mh f we refer the reader to 
iGill et alJ fpOOl andE^l (gQll; for reference, especially 
on the configuration and compilation, the user manual 
available on the Ahf website 3 is also helpful. In this 
paper we focus on the parallel implementation and also 
provide a study of the parameters of the algorithm. How- 
ever, we will give a short description of the main ideas 
underlying Ahf in §2. H and describe the parallelizing ap- 
proach in §2.21 Finally we summarize the parameters in 



2.1. Ahf 

In Ahf we start by covering the whole simulation 
box with a regular grid of a user-supplied size (the do- 
main grid, described by the DomGrid parameter). In 
each cell the particle density is calculated by means 
of a triangular shaped clou d (TSC) weighing scheme 
(|Hocknev fc Eastwoodl ^^S). If the particle density 4 ex- 
ceeds a given threshold (the refinement criterion on the 
domain grid, DomRef), the cell will be refined and cov- 
ered with a finer grid with half the cell size. On the finer 
grid (where it exists), the particle density is recalculated 
in every cell and then each cell exceeding another given 
threshold (the refinement criterion on refined grids, Re- 
fRef) is refined again. This is repeated until a grid is 
reached on which no further cell needs refinement. 

Note that the use of two different refinement criteria 
— one for the domain grid and one for the refinements 



3 http : //www . aip . de/People/AKnebe/AMIGA 

4 Please note that the particle density is used, not the mass 



density. 



— has historical reasons routed in the requirement that 
for a cosmological simulation the interparticle separation 
of the initial particle distribution should be resolved and 
hence DomGrid = 2N, with N 3 being the total number of 
particles. However, this might lead to memory problems 
and hence the ability to choose a coarser domain grid 
but to refine it more aggressively was provided. For halo 
finding the choice of the domain grid is rather arbitrary. 
We will test for the influence of these parameters later 
on. 

Following the procedure outlined above yields a grid 
hierarchy constructed in such a way that it traces the 
density field and can then be used to find structures in 
cosmological simulations: Starting on the finest grid, iso- 
lated regions are identified and marked as possible halos. 
On the next coarser grid again isolated refinements are 
searched and marked, but now also the identified regions 
from the finer grid are linked to their corresponding vol- 
ume in the coarser grid. Note that by construction the 
volume covered by a fine grid is a subset of the volume 
covered by the coarser grids. 

By following this procedure, a tree of nested grids is 
constructed and we follow the halos from 'inside-out', 
stepping in density contour level from very high densities 
to the background density. Of course it can happen that 
two patches which are isolated on one level link into the 
same patch on the next coarser grid in which case the 
two branches of the grid tree join. The situation after 
this step is depicted in the upper row of figure [TJ 

In a later step, once the grid forest is constructed, the 
classification of substructure can be made. To do this, we 
process each tree starting from the coarsest level down- 
wards to the finer levels, the procedure is also illustrated 
in figure Q] Once the finer level splits up into two or more 
isolated patches, a decision needs to be made where the 
main branch continues. This is done by counting the par- 
ticles contained within each of the isolated fine patches 
and we choose the one containing the most as the main 
branch, whereas the others are marked as substructures 5 . 
Note that this procedure is recursive and also applies to 
the detection of sub-sub-structure. 

Assuming now that the leaf of each branch of the grid 
tree corresponds to a halo, we start collecting particles by 
first assigning all particles on the corresponding isolated 
grids to this center. Once two halos 'merge' on a coarser 
level, we tentatively assign all particles within a sphere 
given by half the distance to the host halo (here, the no- 
tation of substructure generated by the construction of 
the grid tree enters) . We then consider the halo in isola- 
tion and iteratively remove unbound particles, the details 
of this process are described in appendix [Al Particles not 
bound to a subhalo will be considered for boundness to 
the host halo. By doing this for all prospective centers, 
we construct a list of halos with their respective particles; 
note that subhalos are included in their parent halo by 
default, but there is, however, an option to not include 
them. 

The extent of the haloes is defined such that the virial 
radius is given by 

p(r vir ) = A vir (z)p b (1) 
where p{r) denotes the overdensity within r and pb is the 

5 There are also different criteria available and described in detail 
in the user manual to make this decision. 
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Fig. 1. — Here we illustrate the process of classifying a grid 
tree into substructures. In the top row, an arbitrary sample grid 
structure is shown on the left and the corresponding grid tree on the 
right. The classification starts by labeling the coarsest grid as part 
of the host and then proceeds to the next level. This is depicted 
in the second row: Five isolated grids are embedded and the one 
containing the most particles is marked as the 'host'. The other 
four grids are then marked as subhalos. This process is repeated 
for the next levels, always deciding for each isolated grid whether 
is it the start of a new substructure or part of the parent structure. 

background density. Hence the mass of the halo becomes 

(2) 



M vir = 47rp 6 A vir (*)rS ir /3 



Note that the virial overdensity A v i r depends on the red- 
shift and the given cosmology, it is however also possible 
to choose a fixed A. This definition for the extent of 
a halo does not necessarily hold for subhalos, as it can 
happen that the overdensity never drops below the given 
threshold. The subhalo is therefore truncated at higher 



overdensities, given by a rise in the density profile. 

We would like to note that host halos (or in other 
words field halos) initially include all particles out to 
the first isodensity contour that fulfills the criterion 
Piso < A v i r (z)pb. Then the same procedure as outlined 
above is applied, i.e. the halo is considered in isolation 
and unbound particles are iteratively removed. 

The (bound) particle lists will then be used to calcu- 
late canonical properties like the density profile, rotation 
curve, mass, spin, and many more. After the whole halo 
catalog with all properties has been constructed, we pro- 
duce three output files: The first one contains the inte- 
gral properties of each halo, the second holds for each 
halo radially binned information and the third provides 
for each halo a list of the IDs of all particles associated 
with this halo. 

As this algorithm is based on particles, we natively 
support multi-mass simulations (dark matter particles 
as well as stars) and also SPH gas simulations. For Ahf 
they all are 'just' particles and we can find halos based 
on their distribution, however, for the gas particles of 
SPH simulations we also consider their thermal energy 
in the unbinding procedure. Even though, for instance, 
the stellar component is much more compact than the 
DM part, this does not pose a challenge to Ahf due to 
its AMR nature: the grid structure is generated based 
on particle densities rather than matter densities that 
are obviously dominated by dark matter particles. 

To summarize, the serial version of Ahf exhibits, in 
principle, only few parameters influencing the halo find- 
ing. The user has three choices to make, namely the size 
of the domain grid (DomGrid), the refinement criterion 
on the domain grid (DomRef) and the refinement crite- 
rion on the refined grids (RefRef) need to be specified. 
While all these three parameters are mainly of a techni- 
cal nature they nevertheless influence the final halo cat- 
alogues obtained with Ahf. The code utilizes isodensity 
contours to locate halo centres as well as the outer edges 
of the objects. Therefore, DomGrid along with DomRef 
sets the first such isodensity level whereas RefRef deter- 
mines how finely the subsequent isodensity contours will 
be spaced: if RefRef is set too large an object may not 
be picked up correctly as Ahf samples the density con- 
tours too coarsely. We refer the reader to a more indepth 
study of the influence of these parameters on the results 
in Section [3l 

2.2. Parallel Approach 

We will now describe the approach taken to parallelize 
the halo-finding. In §2.2.11 we describe the way the vol- 
ume is decomposed into smaller chunks and we then elab- 
orate in ^2.2.21 how the load-balancing is achieved before 
ending in §2.2.31 with some cautionary notes on the ap- 
plicability of this scheme. 

2.2.1. Volume Decomposition 

There are many ways to distribute the workload to 
multiple processes. In some halo finders this is done by 
first generating particle groups with a Fof finder which 
can then independently processed (e.g. iKim &: Parkl 
2006). However, we chose to 'blindly' split the whole 
computational volume to multiple CPUs. This is done 
because we plan to implement Ahf as part of the Simula- 
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tion code Amiga and not only as a stand-alone version 6 ; 
the same as Mhf is also integrated into Mlapm and can 
perform the halo finding 'on the fly' during the runtime 
of a simulation. For the simulation code, a proper vol- 
ume decomposition scheme is required and hence we are 
using the approach described here. However, it should 
be noted that, in principle, Ahf is capable of working on 
any user defined sub-set of a simulation box and as such 
is not tied to the employed decomposition scheme. 

The volume decomposition is done by using a space- 
filling curve (SFC), which maps the three spatial coordi- 
nates (x, y, z) into a one-dimensional SFC-index i; an 
illustration is given in the left panel of figure BJ As 
many other worker s have done before fe.g. lSpringelll2005t 
iPrunet et afl feOOS) we use the Hilbert-curve. A new pa- 
rameter, LB, of Ahf regulates on which level this or- 

I R 

dering is done; we then use 2 cells per dimension; 
it is important to remark that the grid used for load- 
balancing is distinct from the domain grid and therefore 
an additional parameter. Each process will then receive 
a segment, described by a consecutive range of indices 
istart ■ ■ • «stop, of the SFC curve and hence a sub- volume 
of the whole computational box. The Hilbert curve has 
the useful property to preserve locality and also to min- 
imize the surface area. It is in this respect superior to a 
simple slab decomposition, however at the cost of volume 
chunks with a complicated shape. 

In addition to its segment of the SFC curve, each pro- 
cess will also receive copies of the particles in a buffer 
zone around its volume with the thickness of one cell of 
the LB-grid (cf. the right panel of figure O . With these 
particles, each CPU can then follow the standard recipe 
described above f £|2.ip in finding halos. The thickness 
of the boundary is hence an important quantity that is 
supposed to counteract the situation in which one halo is 
situated close to the boundary of two processes. As there 
is no further communication between the tasks, for rea- 
sons we will allude to below, the buffer zones need to be 
large enough to encompass all relevant particles. To ful- 
fill this requirement, the thickness of the boundary zone, 
given by the LB parameter, should obey the relation 



< 



B 



Dmax 
^vir 



(3) 



where B is the size of the simulation box and i?"]j ax is the 
radius of the largest objects of interest. Note that each 
process only keeps those halos whose centers are located 
in the proper volume (as given by the SFC segment) of 
the process. 

2.2.2. Load-balancing 

To subdivide the SFC curve, different schemes can be 
employed, influencing the quality of the achieved load- 
balancing; this is ultimately measured by the difference 
of the run times of the different processes. We chose 
to use a scheme that distributes the particles evenly be- 
tween the processes, e.g. each CPU will require the same 
amount of storage for the particles. This can of course 
lead to vastly different volumes assigned to each task 
(cf. figure [2]) , but we will show in i )5.2l that this simple 
scheme can provide reasonable results. 

6 Note that for the serial version this is already the case. 
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Fig. 2. — A sample volume decomposition is shown in the left 
panel, the total volume is divided into four segments along a Hilbert 
curve (dashed line) on the load-balance grid. Here we choose LB = 
3 and only show a two dimensional sketch for the sake of clarity. 
The right panel shows for one selected segment, that is assigned to 
one MPI process, which additional boundary volume elements will 
be copied to the task. 

We therefore segment the SFC curve into chunks con- 
taining the same (within the precision allowed by the 
coarseness of the load balance grid given by LB) number 
of particles. As with this scheme (or any scheme based 
on segmenting the volume along a SFC curve) it can hap- 
pen that objects are cut between different processes, the 
grid tree construction would be very communication in- 
tensive. To circumvent this, we use the inclusion of a 
buffer zone alluded to above. 

Note that after the duplication of boundary particles 
the balance of particles might shift, which is especially 
prominent for very inhomogeneous particle distributions, 
as found, for example, in simulations with a small box 
size. Since the decision of how to segment the SFC 
curve is modular, it is very easy to implement a different 
scheme, we plan on investigating into further criteria in 
future work. 

2.2.3. Caveats 

As we described above, we require that a single halo is 
processed by one task, we do not yet allow for parallelism 
on a halo level. Hence we introduced the duplication of 
particles, which can lead to an unfortunate distribution 
of the work-load: Large halos may require more time to 
be processed than many smaller halos. To counteract 
this, we provide the option of dividing the halo finding 
into a splitting and an analysing step. Only the splitting 
step needs to be performed by a parallel task in which 
each process will dump its particles to a restart file. All 
those can then be analysed independently. 

This becomes important for large simulations, contain- 
ing hundreds of million of particles. On smaller simula- 
tions (up to 512 3 ) we find that the runtime unbalance is 
not very significant in absolute terms and a direct par- 
allel approach is very well feasible, we will discuss this 
further in 



2.3. The parallel parameters of Ahf 

To summarize, besides of the three already existing pa- 
rameters (cf. Section I2.1| the user of the MPI-enabled 
version of Ahf has to choose two additional parameters: 
the number of CPUs will influence the total runtime and 
is the parameter that can be adapted to the available re- 
sources. Very crucial, however, is the size of the bound- 
ary zones, given by LB, which has a significant influence 
on the reliability of the derived results and must be cho- 
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TABLE 1 
Summary of the simulation 
parameters. 



TABLE 2 

Summary of the analyses parameters of B20, B50 and 
B1500. 





Name 


Blh' 1 Mpc] 


^Vpart 


Set 1 


B20 


20 


256 3 




B50 


50 


256 3 




B1500 


1500 


256 3 


Set 2 


B9361o 


936 


256 3 




B936me 


936 


512 3 




B936hi 


936 


1024 3 



sen carefully (cf. equation [3]). 

All of the Ahf parameters are of a more technical na- 
ture and hence should not influence the final halo cata- 
logue. However, we like to caution the reader that inap- 
propriate choices can in fact lead to unphysical results. 
The following Section [3] therefore aims at quantifying 
these influences and deriving the best-choice values for 
them. 

3. TESTING 

In this section we are discussing the impact of the free 
parameters of Ahf on the halo finding. In §3.11 we first 
introduce the ensemble of simulations we used and de- 
scribe the tests. To gauge the influence of numerical 
effects, we perform one additional test in §3.2[ before 
systematically varying the parameters in § 3 . 3tf375l 

3.1. Performed tests 

We use two different sets of simulations, summarized 
in table [TJ To investigate the impact of the free pa- 
rameters of the algorithm and to produce comparative 
figures of the serial version to the parallel version, we 
employ the simulations containing 256 3 particles con- 
tained in set 1. The boxes have different sizes, namely 
20 hr 1 Mpc, 50 hr 1 Mpc and 1.5 hr 1 Gpc. The first simu- 
lation be longs to the set o f simu lations described in more 
detail in iKnebe fc Power! ([2008) . The particulars of the 
second simulation are described elsewhere (Power et al. 
in pre p.) and the last one is described in IWagner et al.1 
(2008). Note that with this particle resolution we can 
still use the serial version to produce halo catalogs. An 
additional set of simulations is later used (cf. M5 . 3[) to in- 
vestigate the scaling behaviour of Ahf with the number 
of particles. The three simulations of the 936 /i -1 Mpc 
box forming set 2 were provided to us by Christian Wag- 
ner. 

For a successful run of the parallel version of Ahf, five 
parameters need to be chosen correctly, each potentially 
affecting the performance and reliability of the results. 
These are: 

• DomGrid: the size of the domain grid (cf. ^3.3p 

• DomRef: the refinement criterion on the domain 
grid (cf. 3331) 

• RefRef : the refinement criterion on the refined grid 
(cf. MM 

• LB: the size of boundary zones (cf. fc|3.5|) 

• CPU: the number of MPI-processes used for the 
analysis (cf. £|3.5[) 



Name" 


DomGrid 


CPU° 


LB C 


DomRef 


RefRef 


064-01-5-5 0-5 


64 


\ 


5 


5.0 


5.0 


064-01-5-1 0-5 


64 


I 


5 


1.0 


5.0 


1 28-01-5-5 0-5 


128 


I 


5 


5.0 


5.0 


1 98 01 U fl fid 


128 


I 


5 


1 


5 


1 98 01 ^ 1 A 


128 


I 


5 


1 


4 


1 28-01-5-1 0-4 


128 


x 


5 


1.0 


3.0 


1 28-01-5-1 0-4 


128 


I 


5 


1.0 


2.0 


1 28-02-4-5 0-5 


128 


2 


4 


5.0 


5.0 


1 98 09 ^ ^ ^ 


128 


■) 


5 


5 


5 


1 98 09 ^ ^ 


128 


2 


(] 


5 


5 


128-02-7-5.0-5.0 


128 


2 


7 


•5.0 


5.0 


128-04-4-5.0-5.0 


128 


4 


i 


5.0 


5.0 


128-04-5-5.0-5.0 


128 


4 


5 


5.0 


5.0 


128-04-6-5.0-5.0 


128 


4 


6 


5.0 


5.0 


128-04-7-5.0-5.0 


128 


4 


7 


5.0 


5.0 


128-08-4-5.0-5.0 


128 


8 


4 


5.0 


5.0 


128-08-5-5.0-5.0 


128 


8 


5 


5.0 


5.0 


128-08-6-5.0-5.0 


128 


8 


6 


5.0 


5.0 


128-08-7-5.0-5.0 


128 


8 


7 


5.0 


5.0 


128-16-4-5.0-5.0 


128 


16 


4 


5.0 


5.0 


128-16-5-5.0-5.0 


128 


16 


5 


5.0 


5.0 


128-16-6-5.0-5.0 


128 


16 


6 


5.0 


5.0 


128-16-7-5.0-5.0 


128 


16 


7 


5.0 


5.0 



a The name is constructed from the parameters in the follow- 
ing way: DomGrid-CPU-l_B-DomRef-RefRef. b Analyses employ- 
ing one CPU have been performed on an Opteron running at 
1.8GHz, whereas the analyses utilizing more than one CPU were 
performed on Xeons running at 3GHz. c Note that this parame- 
ter has no effect for runs with only one CPU. 

The latter two apply for parallel runs only. We system- 
atically varied those parameters and analyzed the three 
simulations of set 1 described above. In tabled we sum- 
marize the performed analyses. 

From each analysis, we will construct the mass- 
function N(AM) in logarithmic mass bins and the spin 
parameter distribut ion 7V(AA), w here the spin parame- 
ter is calculated as ([Peebles! 1 19691 ) 



A = 



(4) 



GM 5 / 2 

where J is the magnitude of the angular momentum of 
material within the virial radius, M is the virial mass 
and E the total energy of the system. The spin parame- 
ter A hence combines mass and velocity information and 
therefore allows for a good stability test of the results. 
We will further cross-correlate the halo-catalogs to inves- 
tigate differences in the deduced individual halo proper- 
ties. 

3.2. Numerics 

As the grid hierarchy plays the major role in identify- 
ing halos, we test the sensibility of the derived properties 
on the grid structure by introducing small numerical arti- 
facts which can lead to slightly different refinement struc- 
tures. To do this, we employ only one CPU and perform 
two analyses, one on the original particle distribution 
and one on a shifted (by half a domain grid cell in each 
direction, taking periodicity into account) particle distri- 
bution; we otherwise keep all other parameters constant 
at DomGrid = 64, DomRef = 5.0 and RefRef = 5.0. The 
ratio of the resulting mass-functions and spin-parameter 
distributions are shown in figure OS 

Small deviations can be seen in the mass function, 
which is however of the order of 1 — 3 per cent in a few 
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Fig. 3. — We show a comparison of derived global properties 
for the original particle distribution and a distribution that has 
been shifted by half a domain grid cell (DomGrid = 64) in each 
direction. The ratio of the shifted to the unshifted mass function 
(spin parameter distribution) is shown in the upper (lower) panel. 

bins and then mostly due to one or two haloes changing a 
bin. This translate into some scatter in the spin param- 
eter distribution, however the seemingly large deviation 
on the low- and high-spin end are artificially enhanced 
due to the low number counts in the bins, we also ex- 
cluded halos with less than 100 particles (cf. H3.4|) from 
this plot. The scatter here is also of the order of 1 — 3 
per cent. It is important to keep in mind that the de- 
rived properties can vary to this extent simply due to 
numerical effects. 

However, the main point of this comparison is to ver- 
ify that, up to numerical effects caused by the perturbed 
density field, we do not introduce any systematic devia- 
tion in the statistical properties. 

3.3. Domain Grid 

We now focus on the choice of DomGrid, i.e. the size of 
the domain grid. To this extent, we employ one process 
and vary the domain grid and the refinement criterion 
DomRef on the domain grid. In figure [4] we show the de- 
duced mass-function and spin parameter distribution for 
the four cases. As can be seen, the impact of the domain 
grid choice is negligible; small deviations can be seen at 
the high mass end of the mass function in B50, how- 
ever these are caused by (of order) one halo changing bin 
across runs with varied parameters. We will discuss the 
drop of N(AM) at the low M end of the mass function 
below. 

In view of the parallel strategy the insensitivity of the 
results to the choice of the domain grid is reassuring, 
as we can use a rather coarse domain grid and start 
the refinement hierarchy from there; note that the do- 
main grid will be allocated completely on each CPU and 
hence choosing a fine grid would lead to a large number 
of empty cells in the parallel version. 

It can also be seen that the choice of the refinement cri- 
terion on the domain grid has no impact. This is because 



the domain grid is coarse enough as to justify refinement 
everywhere anyhow. In fact, only very pronounced un- 
derdense regions would not trigger refinements for the 
choices of parameters and particle resolution. 

3.4. Refinement Criterion 

Now we investigate the effect of choosing a different 
refinement criterion (RefRef) for the refined grids. We 
limit ourselves to a domain grid of DomGrid = 128 cells 
per dimension and use a refinement criterion on the do- 
main grid of DomRef =1.0 with one process. We then 
vary the refinement criterion on the refined grids from 
RefRef = 5.0 (this corresponds to analysis already used 
above when investigating the impact of the domain grid 
and its refinement criterion) to RefRef = 2.0 in steps of 
1.0. 

In figures and [H] we show the effect of varying the 
refinement criterion on the refined grids on the mass- 
function and the spin-parameter distribution. It can be 
seen that the mass-function changes mainly at the low 
mass end. The changes are related to the mass discrete- 
ness: The more particles a halo is composed of, the eas- 
ier it is to pick it up with our refinement criterion based 
on the number of particles within a cell. Hence it can 
be readily understood that by forcing a smaller crite- 
rion, more halos with a small amount of particles will be 
found. 

This is important for the completeness of the deduced 
halo catalogs: only when choosing a very small refine- 
ment criterion (< 3.0), we are complete at the low- 
particle-count 7 end. 

The spin parameter distribution is also affected. We 
can see in the top row of figure[S]that the peak of the dis- 
tribution shifts slightly and is reduced in height 8 . How- 
ever, when only including halos with more than 100 par- 
ticles (shown in the bottom panels of figure [5]) in the 
calculation of the distribution, the shift is reduced and 
the distributions coincide; note that the most massive 
halo in the analyses of B1500 is resolved with only 237 
particles. 

As we will discuss later, the choice of the refinement 
criterion severely impacts on the runtime and the mem- 
ory requirements (sec figure [T4"]) and hence should be low- 
ered only with care. 

3.5. The boundary zone and the number of processes 

We will now investigate the effect of the size of the 
boundary zone and the number of processes. As the ref- 
erence model we use a serial run with a domain grid of 
DomGrid = 128 cells per dimension, a refinement crite- 
rion of DomRef = 1.0 on the domain grid and a refine- 
ment criterion of RefRef = 5.0 on the refinements; note 
that the LB parameter has no effect in the case of a serial 
run. We also change the number of processes involved in 
the analysis from 1 (the reference analysis for each box) 
in factors of 2 to 16, in which case the LB parameter 
has a very significant meaning: Recall that the volume 

7 This translates in general into a low mass halo, however, the 
important quantity is really the number of particles, not their mass. 
In zoomed simulations, low-mass halos in the zoomed region might 
be found completely whereas halos of the same mass in the low 
resolution regions are not picked up. 

8 The latter is due to the fact that the distribution is not nor- 
malized. 
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Fig. 5. — The effect of the refinement criterion on the refined grids on the mass function. We show the mass functions derived for the 
given analyses in the top and also deviations arbitrarily normalized to the 128-01-1.0-4.0 in the smaller bottom panels. For guidance, the 
vertical lines indicate the mass corresponding to halos with 50 and 100 particles, respectively. The columns are organized as in figure [4] 
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decomposition scheme not only assigns to each CPU the 
associated unique volume, but also a copy of the bound- 
ary layer with a thickness of 1 cell. We increase the size 
of the decomposition grid from a2 4 = 16(LB = 4) cells 
per dimension grid by factors of two up to a 2 7 = 128 
(LB = 7) cells per dimension grid. To correctly identify 
a halo located very close to a boundary, the buffer vol- 
ume must be large enough to encompass all its particles, 
namely the thickness of the boundary should be R™]f x 
(cf. equation [3} . This condition is violated for B20 from 
LB > 4 on and for B50 for LB > 5, whereas in B1500 
it would only be violated for LB > 8. Hence we expect 
to see differences in the derived properties due to the 
volume splitting for all analyses violating this condition. 

3.5.1. Statistical comparison 

We first concentrate on integrated properties, namely 
the mass functions and the spin parameter distributions 
again. In figure[7]we show the ratio of the mass functions, 
whereas in figure [8] the ratio of the spin parameter distri- 
bution is depicted. In all cases we have taken the serial 
run 128-01-5-5.0-5.0 as the reference and show relative 
deviations from it. In each figure the number of em- 
ployed processes increases from top to bottom from 2 to 
16, whereas each single plot shows the derived mass func- 
tion (spin parameter distribution) for the four choices of 
the boundary size. 

It can be seen, that the integrated properties appear 
to be in general rather unaffected by the choice of the 
number of CPUs or the LB level. For B20, the mass 
function only shows a difference for the LB7 analysis, the 
B50 shows differences at the high mass end for LB6 and 
LB7 and more than 4 CPUs. Looking at the spin param- 
eter distribution, only B20 shows a difference at the large 
A end for LB6 and LB7. B1500 is completely unaffected 
by any choice of tested parameters. As we have alluded 
to above, this is expected and we can clearly see two ef- 
fects coming into play: First, the smaller the boundary 
zone is, the higher the probability that a halo might not 
be available entirely to the analysing task. Second, in- 
creasing the number of CPUs simultaneously increases 
the amount of boundary cells and hence the chance to 
provoke the aforementioned effect. 

However, it is interesting to note that in a statistical 
view we really have to go to the worst case choices (large 
LB, many CPUs), to get a significant effect. Also we 
should note that the shown deviations are relative and 
are hence more pronounced in the bins with low counts, 
namely the outskirts of the spin parameter distribution 
and the high mass end of the mass function. Choosing 
the correct LB value, we can see that the integrated prop- 
erties to not depend on the number of CPUs involved. 

3.5.2. Halo-halo cross comparison 

We will now investigate the dependence of the individ- 
ual halo properties on the parallelizing parameters. To 
this extent we perform a cross-correlation of particles in 
identified objects between the different parallel runs and 
- taken as the reference - a serial run. 

We employ a tool included in the AHF-distribution 
which establishes for two given halo catalogs — for this 
purpose consisting of the particle IDs — for each halo 
from the first catalog the best matching halo from the 



second catalog. This is done by maximizing the quantity 



where iV s hared is number of shared particles between the 
two halos and Ni and N2 are the total number of par- 
ticles of the halo from the first and the second catalog, 
respectively. 

In the ideal case, we expect every particle found in a 
halo in the reference analysis to also appear in the corre- 
sponding halo in the parallel version. Due to the choice 
of the boundary size, this might not be the CctSG, clS for 
halos located close to a boundary the required particles 
might not be stored in the boundary. Also numerical 
effects from a different grid structure in the boundary 
can lead to slight variations in the identified halos, as we 
have demonstrated already in figure [3] Additionally we 
should note that the catalogs produced by two runs with 
different number of CPUs cannot be compared line by 
line, as, even though the ordering is by number of par- 
ticles, halos with the same number of particles are not 
guaranteed to appear in the same order. 

We show this impact for the mass of the halos in fig- 
ure [9] and for the spin parameter in figure [TO] Note that 
we only plot data points for halos that have different 
properties. In total we have 16631 (19621, 4971) objects 
in B20 (B50, B1500). 

It is interesting to note that even though the distribu- 
tion of integrated properties (cf. §3.5.1| show no signif- 
icant effect, we do find slightly different results for indi- 
vidual halos. However, the observed differences in B20 
and B50 for LB6 and LB7 arc to be expected, as in these 
cases the boundary zones are to thin to cope with the ex- 
tent of the halos, as we have alluded to above. For B1500 
we would need to increase the LB level to a significantly 
larger number to provoke the same effects. Addition- 
ally the differences found are generally of the order of a 
few particles as indicated by the dashed lines in figure [H 
which correspond to a difference of 5 particles. We have 
observed in i j3.2l that numerical noise — as introduced 
by shifting the particle positions — already can induce 
fluctuations of the order of a few particles (cf. figure [3]). 

Generally, however, we can say that, choosing a sane 
LB level complying with equation [3J the parallel version 
gives the same result as the serial version. 

4. COMPARISON TO OTHER HALO FINDERS AND 
THEORTICAL PREDICTIONS 

In this section we compare the results of Ahf to 
three other halo finding mechanisms. We restrict our- 
selves to the analy sis of B2 and use a F of halo finder 
(jDavis et all I1985T ). Skid (IStadell 1200 ll). and an im- 
plementation of the Bdm (jKlypin fc Holtzmanl Il997f) 
method. 

For the Fof run, we use a linking length of 0.17, which 
yields an overdensity at the outer radius comparable to 
the virial overdensity used in Ahf. In the Bdm analysis, 
in order to find (potential) halo centers we used a density 
field smoothed on a scale of 10/i _1 kpc, approximately 
corresponding to a mass scale of 10 9 /i _1 Mpc, and we 
therefore expect the mass function to not be complete on 
this scale. Otherwise we use a variant of the original Bdm 
scheme to iden tify the final halos a nd their properties (cf. 
Appendix B in lBullock et al.ll2001f ). For the S KID run, we 
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Fig. 7. — We show the massfunctions derived for our three simulations varying the number of processes and size of the domain 
decomposition grid. 
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Fig. 9. — Mass ratio of halos in the parallel version using 2 (4, 8, 16) processes from top to bottom in the three boxes. Open boxes 
(circles, upright triangles, downright triangles) are used to indicate LB4 (5, 6, 7). The additional curved lines correspond to a difference in 
halo mass of ±5 particles. Note that we only show those halos that have a ratio not equal to unity and we see the expected behaviour that 
the mismatch becomes more promiment in the case of a large LB (cf. discussion in £|3.5,2I 
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Fig. 10. — As figure [9] but for the spin parameter as a function of halo mass. 
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used a linking length of 3e, where e is the force resolution 
of the simulation (e = 1.5/i _1 kpc). 

Besides a direct comparison between these halo finders 
in t j4.ll we will further investigate in subsection I4.2I how 
well theoretical descriptions presented in the literature 
describe the numerically obtained mass functions. To 
this extent, we use the highest resolved simulation of set 
2, B963hi. 

As one of our claims is that Ahf is capable of iden- 
tifying field halos as well as subhalos simultaneously it 
also appears mandatory to compare the derived subhalo 
mass function against the findings of other groups. These 
results will be presented in subsection I4.3I 



4.1. Comparison to other halo finders 

In figure [TT] we show the derived mass functions for the 
four differnt codes in the top panel alongside Poissonian 
error bars for each bin. The lower panel investigates 
the relative deviation of the three additional halo finders 
tested in this section to the results derived from our Ahf 
run. To this end, we calculate 

N AHF (AM) - N X (AM) 

N Abf (AM) 1 1 

where X is either Bdm, Fof or Skid. Hence a positive 
deviation means that the Ahf run found more halos in 
the given bin than the halo finder compared to. 

Please note that with the settings used, we do not con- 
sider the mass function below masses corresponding to 
100 particles to be complete in the Ahf analysis. Also 
the sharp decline of the Bdm function is not due to the 
method but rather to the smoothed density field alluded 
to above. 



Generally, we find the mass functions derived with 
Bdm and Skid to match the results from the Ahf run 
within the error bars quite well, whereas the Fof analysis 
shows a systematic offset. This mismatch between mass 
functions deduced with the Fof method and So based 
m ethods is known and expected (see, e.g ., the discussion 
in iTinker et~al||2008t iLukic et al.l I2009L concerning the 
relation of Fof masses to So masses). 

4.2. Comparison to theoretical mass functions 

The entering of the era of precision cosmology in 
the last couple of years made it possible to derive the 
mass function of gravitationally bound objects in cosmo- 
logical simulations with unprecedented accuracy. This 
led to an emergence o f refined analytical formulae (cf. 
Sheth fc Tormenl Il999t Uenkins et all l2001t iReed et all 
2003L IWarren et alJl2006t ITinker et al.H2008P i and it only 



appears natural to test whether or not our halo finder 
Ahf complies with these prescriptions. To this ex- 
tent we are utilizing our best resolved simulation (i.e. 
B963hi of Set 2, cf. Table [T| and present in fig- 
ure [T2l a comparison against t h e mass functions p r opose d 
(Prei 



bv IPress fe Schechterl (fl97l: iSheth fc Tormenl (Tl 9991: 



I Jenkins et all (l2001h: IReed et all (|2003f k IWarren et all 
(|2006f) and ITinker et all (|2008l ). 

In this respect it must be noted that the proposed the- 
oretical mass functions are calibrated to mass functions 
ut ilizing different halo ove rdensities. The mass function 
of ISheth fc Tormenl (|1999f ) is bas ed on halo cat alogs ap- 
plying an overden sity of A = 178 (|Tor men 1998), and the 
IReed et al.l (|2003ft and|Warren_etLaL| j2QM) functions are 
formu lated for overdensities of A = 200. iJenkins et al.l 
(|2001h provide various calibration and we use their equa- 
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Fig. 11. — The mass function derived with different halo finders 
for B20. The top graph depicts the actual mass function and we 
only show the Poisson errors for the Ahf data for clarity. In the 
lower frame the ratio of the mass functions to the Ahf mass func- 
tion is shown. Additionally, the dashed vertical lines indicate the 
halo masses that correspond to 50 and 100 particles, respectively. 

tion B4 which is calibrated to a ACDM simulation and 
an So-halo finder with an overdensi ty of A = 324 quite 
close to our value of A = 340. The ITinker et all (|2008l ) 
mass function explicitly includes the overdensity as a free 
parameter and as such we use the value of our catalog to 
produce the m ass function. 

We find the ITinker et all (|2008h formula to give an 
excellent description of the o bserved mass function and 
also the iJenkins et al.l (|200lD formula provides a good 
description, albeit overestimating the number of halos at 
intermediate masses, which might be due to the slightly 
wrong overdensity. While the other formulas also provide 
an adequate description for the mass function at inter- 
mediate masses, they overestimate the number of halos 
at the high mass end. As has been eluded to above this 
is to be expected and can be understood by the different 
halo overdensitie s they have been calibrat ed to. How- 
ever, clearly the iPress fc Schechterl ()1974l ) prescription 
does not provide a good desc ription of the halo mass 
function in this case. As the Tin ker et alj (|2008f ) for- 
mula provides an excellent fit to our data, we refer the 
reader to their work for further discussions concerning 
the impact of the different overdensities used to define a 
halo. 

4.3. The subhalo mass function 

As our halo finder simultaneously finds field and sub- 
halos without the need to refine the parameters for the 
latter it appears mandatory to compare the derived sub- 
halo mass function against results in the literature, too. 
To this extent, we identify the substructure of the most 
massive halo in B20 with our cross matching tool (cf. 
£ 13. 5.2() . We restrict ourselves here to only investigat- 
ing this particular halo, as the particle resolution for the 
other available simulations is not sufficient to resolve the 
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Fig. 12. — In this figure, the mass function for B963hi derived 
with Ahf is compared to a variety of theoretical forms. Again, the 
vertical dashed lines correspond to the halo masses of 50 and 100 
particles, respectively. 

subhalo population adequately. Also, we did use the 128- 
01-5-1.0-2.0 analysis to have a high sensitivity to very 
small halos (cf. discussion of Ahf's parameters in sec- 
tion HJ). 



It has be e n fou n d before (e.g. |G 



Helmi et all 120021: |Pe Lucia et al 



ligna et al. 2000; 
20041: iGaoeTaLl 



2004 Ivan den Bosch et al.l l2005l iDiemand et alJ 120071 : 
Giocoli et alJl2008HSpringel et al.ll2008l ) that the subhalo 
mass function can be described with a functional form 
iV sub (> M) oc M~ a , with a = 0.7... 0.9. In figure Ql 
we show the cumulative mass function of the most mas- 
sive halo in B20 and provide - as guide for the eye - two 
power laws with those limiting slopes. Additionally, we 
fitted a power law to the actual -/V S ^ F (> M), yielding 
a slope of a = 0.81. This test confirms the ability of 
Ahf to reproduce previous findings for the subhalo mass 
function and hence underlining its ability to function as 
a universal halo finder. 

5. RESULTS 

In this section we summarize the results obtained from 
the parameter study fi )5.ip and present the requirements 
of memory and time. We believe this to be important 
information for understanding how to maximize the gain 
from using Ahf. In this respect, we specifically investi- 
gate the achieved load-balancing of the parallel version 
( §5.2p . We further present the scaling of Ahf with in- 
creasing problem size in H5.3I 

5.1. Parameter dependence 

As we have shown, the DomGrid and DomRef param- 
eters have a negligible impact on the derived properties 
and given an adequate choice for the LB level, also the 
number of employed CPUs has no impact on the derived 
parameters. However the RefRef parameter can be used 
to force completeness of the mass function down to small 
particle counts. This comes with a price as we show in 
figure [T4l this figure depicts the increase in run time and 
memory requirement relative to a choice of RefRef =5.0 
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Fig. 13. — The cumulative subhalo mass function for the most 
massive halo in B20. This halo is resolved with fs 8 X 10 5 particles 
and contains 221 resolved subhalos. Also we show three power law 
fits, two with a fixed slope of —0.7 and —0.9, respectively, and a 
third with the slope as a free parameter, yielding —0.81. Note that 
this plot is based on the 128-01-5-1.0-2.0 analysis. 
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Fig. 14. — The impact of the refinement criterion on the refined 
grids on the memory and time consumption of Ahf. We normalize 
the 128-01-5-1. 0-X.O runs to the 128-01-5-1.0-5.0 run and show the 
relative time in the upper and the relative memory consumption 
in the lower panel. 

when changing the RefRef parameter. As we have shown 
in figure [5] a RefRef parameter of 3.0 would be required 
to achieve completeness for halos with less than 50 par- 
ticles. However this is a factor of 2 — 3 in runtime and an 
increase in memory requirement by ~ 40 per cent. These 
numbers are derived from a serial run. 

5.2. Loadbalancing 

We will now investigate the quality of the load- 
balancing of Ahf. All parallel analyses of set 1 (cf. ta- 



ble [TJ have been performed on the DAMIANA 9 cluster, 
where each node is equipped with two Dual Core Xeon 
processors running at 3GHz. Due to memory constraint 

- especially for the variation of the RefRef parameter 

— the serial analyses have been performed on one of our 
local machines (Opteron running at 1.8GHz) where we 
were certain to not be bound by memory. For this rea- 
son, the times between the serial and parallel runs are 
not directly comparable. 

In figure[TOJwe show the absolute times of the runs and 
we give the span of run times of the separate processes for 
the parallel analyses. It can be seen that increasing the 
number of CPUs involved in the analysis indeed reduces 
the runtime. However, when increasing the number of 
CPUs a widening of the spread between fastest and the 
slowest task can be observed. This is most pronounced in 
B20 and in particular for the time required to do the halo 
analysis, whereas the time required to generate the grid 
hierarchy exhibits a smaller spread. The large spread can 
also be seen in the memory requirement for the grids, 
shown in the lower row of figure 1151 This is especially 
pronounced for small LB levels. When going to larger 
boxes the spread becomes smaller. 

The behaviour of increasing spread with decreasing 
boxsize can be readily understood as the effect of the 
non-homogeneity of the particle distribution manifesting 
itself. A small box will typically be dominated by one 
big halo, whereas in large boxes there tend to be more 
equally sized objects. As we do not yet allow one single 
halo to be jointly analysed by more than one CPU, the 
analysis of the most massive halo will basically dictate 
the achievable speed-up. Of course, when decreasing the 
boundary size (increasing the LB level), the spread can 
be reduced, however, the results will be tainted as has 
been shown in figures [5] and [TO] 

In figure [TO] we closer investigate the achieved mem- 
ory saving R(n) with the number of MPI-processes n, 
defined as 

R(n) = M(l)/M(n) (7) 

where M(n) of course refers to the amount of memory 
required for one process when using n CPUs. We cor- 
rect the memory information for the contribution of the 
domain grid (for the choices of parameters it consumes 
64MB); note that this correction is not done in figure [TBI 
presented above. We chose to do this here to better vi- 
sualize the actual gain in the parallel version, which is 
close to what is expected, when the relative contribution 
of the domain grid to the memory usage is small. That 
is not quite the case in our test simulations, however this 
is true for large simulations. 

Note that linear scaling cannot be expected due to the 
duplicated boundary zones. We show this in figure [TO] 
for the ideal case memory reduction expected including 
the boundary zones (solid gray line; for the case of LB4). 
We can estimate this by comparing the amount of cells 
each process is assigned to as 

R(n,N)=(n-^ + ^\ ' (8) 
I R 

where N = 2 is the number of cells in one dimension 
on the LB-grid and n is the number of CPUs. Note that 

9 For more details on the topology of the cluster, see here: 
http : //supercomputers . aei .mpg.de/damiana/ 
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Fig. 15. — We show the absolute times required for the different analyses in the top row and the total memory requirement for the grid 
hierarchy and the particles in the lower row. The columns are ordered as described in figure [4] The top row gives the duration for the 
whole run, whereas the two middle rows give the times needed for the grid generation and the actual halo analysis. The last row shows 
the amount of memory required to store the particle information and the grid hierarchy. We use the 128-01-5-5.0-5.0 run for producing the 
data point for 1 CPU and plot for the parallel runs the maximal and minimal values. Each panel shows the results for different choices of 
LB and for clarity, the curves are offset to one another by factors two. Note that runs with one CPU were, due to memory restrictions, 
performed on a different, slower, machine than the parallel runs, hence the kink in the timing curves at 2 CPUs is not real. However the 
memory curves are not affected by this. 

for this estimate it is assumed that the amount of work 
that needs to be done per cell is same for all cells and in 
such this is only an ideal case estimate. We can see that 
this curve is tracked closely in larger boxes when the 
distribution of particles is more uniform than in small 
boxes. 

It is important to keep in mind that this holds for sim- 
ulations with 256 3 particles. In fact, the analysis could 
still be performed in reasonable time (~ 30 minutes, de- 
pending on the box size) and with modest memory re- 
quirements (~ 2 — 3GB) on a single CPU. However a 
single CPU approach would be unfeasible already for a 
512 3 simulation and completely out of the question for 
larger simulations. 



5.3. Scalability 

So far we have investigated the scaling behaviour for 
simulations with a fixed particle resolution. It is now in- 
teresting to ask how the algorithm scales when increas- 
ing the number of particles in the simulation. To this 
extent we use the simulations of set 2 (see table Q] for 
the particulars) which were performed on the same ini- 
tial conditions represented with three different particle 
resolutions. All analyses have been performed on hlrbii 
at the LRZ Munchen. 

According to the results derived in $3l we used the 



analysis parameters as DomRef = 64, DomRef = 5.0, 
RefRef = 5.0 and LB = 7. We used the option to di- 
vide the run of Ahf into a splitting and an analysing 
step (cf. g2X3]) and used 4 (16, 64) CPUs for the split- 
ting of B9361o (B936me, B936hi). For the subsequent 
analysing step, we only changed the number of CPUs for 
B936hi from 64 to 12. 10 In table [3] we present the full 
timing information reported by the queueing system for 
the different jobs; note that the wall time for B936hi is 
quite large compared to the wall time used to B9361o 
and B936hi, however this is due to the fact that only 12 
CPUs where used to process the 64 chunks generated in 
the splitting step. 

We also present the scaling in figure [17] were we show 
the required CPU time for all three analyses normalized 
to B9361o. Note that the expected TV log A scaling is 
tracked quite remarkably. It should be noted though 
that the large box size represents the ideal situation for 
Ahf, the scaling will be not as good for smaller box sizes. 



6. SUMMARY & CONCLUSIONS 

We have introduced and described the halo finding 
tool Ahf, which is freely provided for use to the com- 

10 Note that this means that a team of 12 analysis tasks worked 
on the 64 chunks produced in the splitting step. 
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TABLE 3 

Timing results for the analyses of set 2. 



Name 




T™*[h] 


nCPUan 




nCPU S p 


B9361o 


0.40 


0.10 


4 


1.42 


4 


B936mc 


3.74 


0.33 


16 


3.25 


16 


B936hi 


39.42 


3.04 


12 


5.73 


64 



Note. — The subscript 'an' refers to quantities related to the 
actual analysis, where the subscript 'sp' labels quantities related 
to the splitting of the simulation volume. The superscript 'CPU' 
labels the required CPU time, whereas 'Wall' notes the wall clock 
time. Note the times for the analysis process are given in hours, 
whereas the times for the splitting are in minutes. 



10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 
Number of CPUs Number of CPUs Number of CPUs 

Fig. 16. — Here we show the achieved memory reduction as defined by equation[8]of the parallel runs. For guidance, the curves fo r lin ear 
scaling are shown as a dashed (gray) curve and additionally, we show the expected ideal memory scaling with a solid gray line (see i]5.2l for 
more details). 

and can deal with large simulation data. Furthermore 
the adequate inclusion of star and gas particles is sup- 
ported as well. Ahf requires only very few parameters, 
in fact, only five (three) parameters need are to be spec- 
ified for the parallel (serial) version. We have shown in 
£13.31 that the size of the domain grid (DomGrid) and the 
refinement criterion on the domain grid (DomRef) have 
only a very marginal impact on the results, reducing the 
number of relevant parameters to 3 (1). 

However, the refinement criterion on the refined grids 
(RefRef) does influence the completeness of the derived 
halo catalog. We discussed this in detail in M3.4I find- 
ing that to be complete for halos containing less than 50 
particles, a refinement criterion of < 3.0 must be cho- 
sen. However, as we show in figure [HI this increases the 
memory requirement by ~ 40 per cent and the runtime 
by a factor of 2 — 3. This will be mostly interesting for 
analyses of snapshots at high redshift and low particle 
resolution. 

Furthermore we have shown in £j3]that the number of 
CPUs involved in the analysis only has no impact on the 
derived properties when the LB parameter (i j3.5[) is cho- 
sen carefully. However, given a suitable choice of the LB 
parameter a reasonable scaling is achieved ( §5.2|) . This 
is especially important for the memory scaling, as this is 
the key allowing for the analysis of billion-particle simu- 
lations. In fact he have shown in ^15.31 that, given a good 
choice of parameters, Ahf scales very well with increas- 
ing particle resolution. We have shown this explicitly for 
a 1024 3 simulation in this paper and also have success- 
fully employed Ahf pr eviously on 512 3 simulations (e.g. 
iKnollmann et aLlfeOOSl ). 

To summarize the choice of recommended parameters: 

• We suggest to use a DomGrid of 64, 

• a DomRef of 5.0 and 

• a RefRef of 5.0 to achieve trustworthy results for 
halos made up by more than 50 particles. 

• To achieve untainted results in the parallel version, 
the LB level should be chosen in such a way that 
the relation given in equation [3] holds. 

• The number of CPUs should be chosen as small 
as possible, the limiting factor here is the available 
memory. 
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Fig. 17. — The relative CPU time as reported by the qucucing 
system is shown depending on the relative simulation size. The 
data points are normalized to the 256 3 particle representation and 
we show a linear scaling with a dashed gray line and an N log N 
scaling with a solid line. The data points are labeled with the 
simulation resolution they correspond to. 



muni t y 11 . Ahf nati vely reads Gadget (Springe l et al.l 
IMllSpririgel[2005l) format as well as Amiga binaries 

11 Download at http://www.aip.de/People/AKnebe/AMIGA/ 



AHF: AMIGA'S Halo Finder 



17 



As we have shown in g572J the mismatch in runtime be- 
tween the separate tasks can become significant. We pro- 
vide a way around this by a two-staged approach: First, 
the independent volume chunks including the boundary 
are constructed on all CPUs. Then those will be dumped 
to disk and can be analysed in serial (or multiple at 
once, as the chunks are completely independent). This 
approach becomes important for inhomogeneous matter 
distributions (small box sizes) and large number of par- 
ticles (> 512 3 ). This feature is available in the public 
version, for details contact the authors. 
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APPENDIX 
AHF'S UNBINDING PROCEDURE 

In order to remove gravitationally unbound particles we have to obtain the (local) escape velocity v CS c(r) at each 
particle position. As v csc is directly related to the (local) value of the potential, v csc — y/2 \4>\, we integrate Poisson's 
equation (under the assumption of spherical symmetry) : 



r dr V dr I 



(Al) 



The first integral reads as follows 



" 2#~ 


= 4irG f 


dr 


r=0 JO 



p{r')r' 2 dr' = GM(< r) 



This equation shows that d<fi/dr oc M(< r)/r 2 and hence r 2 dcf>/dr — ► for r 
following first-order differential equation for (f>(r): 



Another integration leaves us with 



d(j) _ G M(< r) 
dr r 2 



This time we need to calculate 4>o- We do this by requiring <fi(oo) = 



M{< r') 



dr' 



G 



G 



M«r') dr , 



r-l- 



M«r') dr , 



GM vir 
M vir 



: dr' 



r ./2 



G 



and hence 



t>o = - [G 



(A2) 

0. We are therefore left with the 

(A3) 

(A4) 
(A5) 

h (A6) 
»o (A7) 

(A8) 

(A9) 



Note that we assume that the halo is truncated at r V j r when evaluating the integral ^ dr'. An initial guess 

for particles belonging to a (sub-)halo in a cosmological simulation is based upon the adaptive mesh hierarchy of the 
simulation code Amiga, as we have described in §2.11 Unbound particles are removed iteratively where we integrate 
equation I A4I along a list of radially ordered particles; the same holds for obtaining <fio that has to be re-evaluated for 
every new iteration. 
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